Introduction

The document contains a detailed analysis of forecast sale expectations of High-End Vacuums for Vac-Attack company in December 2020 using historical data. This analysis aims to determine if the company will meet the target and ensure the stock is on hand to match the demand.

The report describes data cleaning, summary statistics, visualisation and analysis using the statistical model fitting, implemented with R programming language.

Loading and installing the required R packages

if (!require(tidyverse))
  install.packages("tidyverse")

if (!require(lubridate))
  install.packages("lubridate")

if (!require(plotly))
  install.packages("plotly")

if (!require(car))
  install.packages("stargazer")

if (!require(xgboost))
  install.packages("xgboost")

library(tidyverse)
library(lubridate)
library(stargazer)
library(xgboost)

Importing Data

First import the Market data.

market_sale_col <- readr::read_csv("Data/MarketingCols.csv", col_names = FALSE)

market_sale_raw <- readr::read_csv("Data/MarketingSales.csv", 
                                   col_names = market_sale_col$X1)

Then import the December data for prediction.

dec_col <- readr::read_csv("Data/DecemberCols.csv", col_names = FALSE)

dec_ad_raw <- readr::read_csv("Data/DecemberAdData.csv", 
                                   col_names = dec_col$X1)

Quick explore of the read-in the data sets

dim(market_sale_raw)
## [1] 1796   11
glimpse(market_sale_raw)
## Rows: 1,796
## Columns: 11
## $ Date                   <chr> "1/01/16", "2/01/16", "3/01/16", "4/01/16", "5/…
## $ PositiveNews           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,…
## $ NegativeCoverage       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Competition            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ AdvertisingSpend       <dbl> 4199.86, 14768.20, 11019.79, 3358.30, 5207.19, …
## $ Month                  <chr> "January", "January", "January", "January", "Ja…
## $ Day                    <chr> "Friday", "Saturday", "Sunday", "Monday", "Tues…
## $ `0508Line_247`         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ UltraEdition_Available <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ COVID_Lockdown         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Sales                  <dbl> 66, 84, 78, 70, 73, 67, 85, 78, 80, 87, 68, 83,…
dim(dec_ad_raw)
## [1] 31  4
glimpse(dec_ad_raw)
## Rows: 31
## Columns: 4
## $ Date             <chr> "1/12/20", "2/12/20", "3/12/20", "4/12/20", "5/12/20"…
## $ AdvertisingSpend <dbl> 10568.28, 8218.31, 4907.38, 8057.25, 21481.50, 484.17…
## $ Month            <chr> "December", "December", "December", "December", "Dece…
## $ Day              <chr> "Tuesday", "Wednesday", "Thursday", "Friday", "Saturd…

Tidy up the variables of both datasets

The next step is to tidy up the data sets for visualisation and modelling. I have included year, month and day variables in both factor and numeric formats.

market_sale_final <- 
  market_sale_raw %>% 
  mutate(Date = dmy(Date),
         Ad_Spend =  AdvertisingSpend,
         Phone_24 = factor(`0508Line_247`),
         Positive_News = factor(PositiveNews),
         Negative_News = factor(NegativeCoverage),
         Competition = factor(Competition),
         Ultra_Edition = factor( UltraEdition_Available), 
         COVID_Lockdown = factor(COVID_Lockdown),
         Year = factor(year(Date)),
         Month = factor(Month, levels = month.name),
         Day = factor(Day, levels =
                        c("Monday", "Tuesday", "Wednesday", "Thursday",
                          "Friday", "Saturday", "Sunday")),
         Month_num = as.numeric(Month),
         Year_num = year(Date),
         Day_num = as.numeric(Day),
         Date_num = date(Date)) %>% 
  select(Sales, Ad_Spend, Date, Phone_24, Positive_News, Negative_News, 
         Competition, Ultra_Edition, COVID_Lockdown,
         Year, Month, Day, Day_num, Date_num, Month_num, Year_num)

dim(market_sale_final)
## [1] 1796   16
glimpse(market_sale_final)
## Rows: 1,796
## Columns: 16
## $ Sales          <dbl> 66, 84, 78, 70, 73, 67, 85, 78, 80, 87, 68, 83, 62, 94,…
## $ Ad_Spend       <dbl> 4199.86, 14768.20, 11019.79, 3358.30, 5207.19, 3962.76,…
## $ Date           <date> 2016-01-01, 2016-01-02, 2016-01-03, 2016-01-04, 2016-0…
## $ Phone_24       <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Positive_News  <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0…
## $ Negative_News  <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Competition    <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Ultra_Edition  <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ COVID_Lockdown <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Year           <fct> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2…
## $ Month          <fct> January, January, January, January, January, January, J…
## $ Day            <fct> Friday, Saturday, Sunday, Monday, Tuesday, Wednesday, T…
## $ Day_num        <dbl> 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2…
## $ Date_num       <date> 2016-01-01, 2016-01-02, 2016-01-03, 2016-01-04, 2016-0…
## $ Month_num      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Year_num       <dbl> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2…

The following dataset is used to predict the total units sold in December 2020. Thus, this dataset contains the advertising spent without the expectation of competition or news stories. However, the Ultra Vac will continue to be sold.

dec_ad_final <-
  dec_ad_raw %>%
  mutate(
    Date = dmy(Date),
    Ad_Spend =  AdvertisingSpend,
    Phone_24 = factor(0, levels = c(0, 1)),
    Positive_News = factor(0, levels = c(0, 1)),
    Negative_News = factor(0, levels = c(0, 1)),
    Competition  = factor(0, levels = c(0, 1)),
    Ultra_Edition = factor(1, levels = c(0, 1)),
    COVID_Lockdown = factor(0, levels = c(0, 1)),
    Year = factor(2020, levels = market_sale_final$Year |> levels() ),
    Month = factor(Month, levels = month.name),
    Day = factor(Day, levels =
                   c("Monday", "Tuesday", "Wednesday", "Thursday",
                        "Friday", "Saturday", "Sunday")),
    Year_num = 2020,
    Month_num = 12,
    Day_num = as.numeric(Day),
    Date_num = date(Date)
  ) %>% 
  select(Ad_Spend, Date, Phone_24, Positive_News, Negative_News, 
         Competition, Ultra_Edition, 
         COVID_Lockdown, Year, Month, Day, Day_num, Date_num, Month_num, Year_num)

dim(dec_ad_final)
## [1] 31 15
glimpse(dec_ad_final)
## Rows: 31
## Columns: 15
## $ Ad_Spend       <dbl> 10568.28, 8218.31, 4907.38, 8057.25, 21481.50, 484.17, …
## $ Date           <date> 2020-12-01, 2020-12-02, 2020-12-03, 2020-12-04, 2020-1…
## $ Phone_24       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Positive_News  <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Negative_News  <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Competition    <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Ultra_Edition  <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ COVID_Lockdown <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Year           <fct> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2…
## $ Month          <fct> December, December, December, December, December, Decem…
## $ Day            <fct> Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday,…
## $ Day_num        <dbl> 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6…
## $ Date_num       <date> 2020-12-01, 2020-12-02, 2020-12-03, 2020-12-04, 2020-1…
## $ Month_num      <dbl> 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12,…
## $ Year_num       <dbl> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2…

Data exploration

Data visulisation

The first step is to generate some plots to look for any relationships and outliers.

The first plot is a line plot on total unit sold versus time.

ggplot(market_sale_final, aes(x = Date, y = Sales)) +
  geom_path() +
  scale_x_date(date_breaks = "6 month", expand = c(0.01, 0.01), 
                 date_labels = "%b %Y") + 
  theme_light()

There is clearing some seasonality on the total unit sold versus time for adjustment, thus, Year and Month variables might be important variables for the modelling fitting.

The second plot examines the relationship between the total unit sold versus the total advertising spend.

ggplot(market_sale_final, aes(x = Ad_Spend, y = Sales)) +
  geom_point()

There is a weak positive correlation between the Advertising Spend and total unit sold, with the Pearson correlation of 41.1%.

The plot is to see if the Advertising Spend logarithm transform can improve the linear relationship to the total unit sold.

ggplot(market_sale_final, aes(x = log(Ad_Spend +1), y = Sales)) +
  geom_point()

The correlation between the Advertising Spend and total unit sold is still weak, with the Pearson correlation becomes worse of 30.4%. Thus, I will not logarithm transform the total Advertising Spend values.

The following plot below is a line plot on total sold units and advertising spend across time.

ggplot(market_sale_final, aes(x = Date)) +
  geom_line(aes(y = Sales), linewidth = 1.5, color = "red", alpha = 0.8) + 
  geom_line(aes(y = Ad_Spend/100), color = "blue", alpha = 0.8) + 
  scale_y_continuous(
    name = "The units sold",
    # Add a second axis and specify its features
    sec.axis = sec_axis(~.*100, name="Total Advertising Spend ($)")
  ) +
  scale_x_date(date_breaks = "6 month", expand = c(0.01, 0.01), 
                 date_labels = "%b %Y") + 
  theme_light() +
  theme(
    axis.title.y = element_text(color = "red", size=13),
    axis.title.y.right = element_text(color = "blue", size=13)
  )

Again, there is some weak relationship between the total sold units and advertising spend across time, but modelling will require further confirm its relationship.

The following plot is to examine the distribution of the total units sold, to look for any possible skewness, which might require other data transformation.

hist(market_sale_final$Sales)

The total number of Sales is looking reasonably normal, i.e. bell-shaped; thus there is no need to apply any additional normalisation methods.

Descriptive summary

The following set of tables contains some summary statistics to allow me to have some initial inspection on the total units sold against of the variables of the datasets.

Total units sold versus years.

market_sale_final %>%
  group_by(Year) %>%
  summarise(Sales_mean = mean(Sales),
            Sales_sum = sum(Sales)) %>% 
  knitr::kable()
Year Sales_mean Sales_sum
2016 89.48087 32750
2017 82.02740 29940
2018 76.81370 28037
2019 78.07671 28498
2020 66.04179 22124

The total units sold appears to decline from year to year.

Total units sold versus months

market_sale_final %>%
  group_by(Month) %>%
  summarise(Sales_mean = mean(Sales),
            Sales_sum = sum(Sales))%>% 
  knitr::kable()
Month Sales_mean Sales_sum
January 66.17419 10257
February 65.78169 9341
March 64.52258 10001
April 59.66000 8949
May 65.76129 10193
June 71.71333 10757
July 71.52258 11086
August 80.06452 12410
September 88.28000 13242
October 91.63226 14203
November 107.91333 16187
December 118.73387 14723

The most of total units sold appears to be at the later months of the year.

Total units sold versus days of the week

market_sale_final %>%
  group_by(Day) %>%
  summarise(Sales_mean = mean(Sales),
            Sales_sum = sum(Sales))%>% 
  knitr::kable()
Day Sales_mean Sales_sum
Monday 78.26848 20115
Tuesday 78.85156 20186
Wednesday 77.59766 19865
Thursday 78.82422 20179
Friday 79.33852 20390
Saturday 78.82101 20257
Sunday 79.21012 20357

The total units sold appears to very similar between the days of each week.

Total units sold versus having the Ultra Edition.

market_sale_final %>%
  group_by(Ultra_Edition) %>%
  summarise(Sales_mean = mean(Sales),
            Sales_sum = sum(Sales))%>% 
  knitr::kable()
Ultra_Edition Sales_mean Sales_sum
0 81.28650 74052
1 76.04181 67297

The total units sold appears to be similar with or without the Ultra Edition.

Total units sold versus having the Positive news.

market_sale_final %>%
  group_by(Positive_News) %>%
  summarise(Sales_mean = mean(Sales),
            Sales_sum = sum(Sales)) %>% 
  knitr::kable()
Positive_News Sales_mean Sales_sum
0 78.22094 135948
1 93.12069 5401

The total units sold is higher with the Positive news.

Total units sold versus having the Negative news.

market_sale_final %>%
  group_by(Negative_News) %>%
  summarise(Sales_mean = mean(Sales),
            Sales_sum = sum(Sales)) %>% 
  knitr::kable()
Negative_News Sales_mean Sales_sum
0 78.85618 140364
1 61.56250 985

The total units sold is higher without the Negative news.

Total units sold versus having the Competition

market_sale_final %>%
  group_by(Competition)  %>%
  summarise(Sales_mean = mean(Sales),
            Sales_sum = sum(Sales))%>% 
  knitr::kable()
Competition Sales_mean Sales_sum
0 80.09143 111247
1 73.96069 30102

The total units sold appears to be similar with or without the Competition.

Total units sold versus having the COVID Lockdown

market_sale_final %>%
  group_by(COVID_Lockdown) %>%
  summarise(Sales_mean = mean(Sales),
            Sales_sum = sum(Sales))%>% 
  knitr::kable()
COVID_Lockdown Sales_mean Sales_sum
0 80.24484 139947
1 26.96154 1402

The total units sold is significant higher without the COVID Lockdown.

Total units sold versus having the 24/7 call line.

market_sale_final %>%
  group_by(Phone_24) %>%
  summarise(Sales_mean = mean(Sales),
            Sales_sum = sum(Sales))%>% 
  knitr::kable()
Phone_24 Sales_mean Sales_sum
0 76.10649 95057
1 84.62888 46292

The total units sold appears to be higher when there is 24/7 call line available.

Modelling using linear regression

Since the distribution of the total units sold is relatively normal with a belt-shaped curve, I will start by fitting a multiple linear regression model for the total units sold, as seen below. Using a multiple linear regression model is a good starting point, because it is relatively straightforward to fit the model and perform model diagnostics to decide whether a more complicated modelling technique is required. Further, linear regression analysis makes inference easy, i.e. interpreting which predictor variable has statistically significant effects or has the most influence on the target variable. Lastly, linear regression analysis can be used to make predictions about the value of the target variable based on the values of the predictor variables.

Fitting the model

linear_reg_fit <-
  lm(
    Sales ~ Ad_Spend + (Year_num * Month) / Day +
      COVID_Lockdown + Ultra_Edition + Competition + Phone_24 + 
      Positive_News + Negative_News,
    data = market_sale_final
  )

ANOVA table of the initial model

anova(linear_reg_fit)
## Analysis of Variance Table
## 
## Response: Sales
##                      Df Sum Sq Mean Sq   F value    Pr(>F)    
## Ad_Spend              1 151427  151427 4567.3059 < 2.2e-16 ***
## Year_num              1  84876   84876 2560.0098 < 2.2e-16 ***
## Month                11 520587   47326 1427.4418 < 2.2e-16 ***
## COVID_Lockdown        1  43105   43105 1300.1339 < 2.2e-16 ***
## Ultra_Edition         1   1042    1042   31.4209 2.421e-08 ***
## Competition           1   5867    5867  176.9491 < 2.2e-16 ***
## Phone_24              1  12119   12119  365.5325 < 2.2e-16 ***
## Positive_News         1   9669    9669  291.6337 < 2.2e-16 ***
## Negative_News         1   7207    7207  217.3837 < 2.2e-16 ***
## Year_num:Month       11   3837     349   10.5219 < 2.2e-16 ***
## Year_num:Month:Day   72   1698      24    0.7114    0.9677    
## Residuals          1693  56130      33                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Choose a model by AIC in a Stepwise Algorithm

linear_reg_fit_final <- step(linear_reg_fit) 
## Start:  AIC=6388.04
## Sales ~ Ad_Spend + (Year_num * Month)/Day + COVID_Lockdown + 
##     Ultra_Edition + Competition + Phone_24 + Positive_News + 
##     Negative_News
## 
##                      Df Sum of Sq    RSS    AIC
## - Year_num:Month:Day 72      1698  57829 6297.6
## - Competition         1         0  56131 6386.0
## - Ultra_Edition       1        62  56192 6388.0
## <none>                             56130 6388.0
## - Negative_News       1      7195  63325 6602.7
## - Positive_News       1      8542  64672 6640.5
## - Phone_24            1      8939  65070 6651.5
## - COVID_Lockdown      1     30613  86743 7167.8
## - Ad_Spend            1    136145 192276 8597.4
## 
## Step:  AIC=6297.57
## Sales ~ Ad_Spend + Year_num + Month + COVID_Lockdown + Ultra_Edition + 
##     Competition + Phone_24 + Positive_News + Negative_News + 
##     Year_num:Month
## 
##                  Df Sum of Sq    RSS    AIC
## - Competition     1         0  57829 6295.6
## <none>                         57829 6297.6
## - Ultra_Edition   1        70  57899 6297.7
## - Year_num:Month 11      3837  61666 6391.0
## - Negative_News   1      7533  65361 6515.5
## - Phone_24        1      8970  66798 6554.5
## - Positive_News   1      8970  66799 6554.6
## - COVID_Lockdown  1     30820  88648 7062.8
## - Ad_Spend        1    140497 198326 8509.0
## 
## Step:  AIC=6295.57
## Sales ~ Ad_Spend + Year_num + Month + COVID_Lockdown + Ultra_Edition + 
##     Phone_24 + Positive_News + Negative_News + Year_num:Month
## 
##                  Df Sum of Sq    RSS    AIC
## <none>                         57829 6295.6
## - Ultra_Edition   1        87  57916 6296.3
## - Year_num:Month 11      3858  61686 6389.6
## - Negative_News   1      7533  65362 6513.5
## - Positive_News   1      8971  66800 6552.6
## - Phone_24        1     14180  72009 6687.4
## - COVID_Lockdown  1     33773  91602 7119.7
## - Ad_Spend        1    140502 198330 8507.1

ANOVA table of the final model

anova(linear_reg_fit_final) 
## Analysis of Variance Table
## 
## Response: Sales
##                  Df Sum Sq Mean Sq  F value    Pr(>F)    
## Ad_Spend          1 151427  151427 4624.336 < 2.2e-16 ***
## Year_num          1  84876   84876 2591.975 < 2.2e-16 ***
## Month            11 520587   47326 1445.266 < 2.2e-16 ***
## COVID_Lockdown    1  43105   43105 1316.368 < 2.2e-16 ***
## Ultra_Edition     1   1042    1042   31.813 1.973e-08 ***
## Phone_24          1  17964   17964  548.581 < 2.2e-16 ***
## Positive_News     1   9666    9666  295.195 < 2.2e-16 ***
## Negative_News     1   7211    7211  220.227 < 2.2e-16 ***
## Year_num:Month   11   3858     351   10.710 < 2.2e-16 ***
## Residuals      1766  57829      33                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model diagnostics

hist(linear_reg_fit_final$residuals)

plot(fitted(linear_reg_fit_final), resid(linear_reg_fit_final) )

# calculate the mean squared error
mse_linear <- 
  sum(market_sale_final$Sales - 
  predict.glm(linear_reg_fit_final,
           newdata = market_sale_final, 
           type = "response") ) ^2

The AIC and mean square error from this multiple linear regression model are 1.1394401^{4} and 2.1707956^{-13}, respectively. This histogram of the residual is relatively normal, and the residual plot is randomly scattered around the residual of zero.

Parameter estimates

summary(linear_reg_fit_final)  
## 
## Call:
## lm(formula = Sales ~ Ad_Spend + Year_num + Month + COVID_Lockdown + 
##     Ultra_Edition + Phone_24 + Positive_News + Negative_News + 
##     Year_num:Month, data = market_sale_final)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.7034  -3.9216  -0.1025   3.4740  26.5912 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              6.996e+03  8.309e+02   8.419  < 2e-16 ***
## Ad_Spend                 1.428e-03  2.181e-05  65.504  < 2e-16 ***
## Year_num                -3.442e+00  4.118e-01  -8.360  < 2e-16 ***
## MonthFebruary           -4.700e+03  9.452e+02  -4.972 7.27e-07 ***
## MonthMarch              -3.356e+03  9.330e+02  -3.597 0.000331 ***
## MonthApril              -6.489e+03  1.033e+03  -6.280 4.26e-10 ***
## MonthMay                -6.430e+03  9.516e+02  -6.756 1.91e-11 ***
## MonthJune               -6.999e+03  9.358e+02  -7.479 1.17e-13 ***
## MonthJuly               -7.448e+03  9.346e+02  -7.970 2.83e-15 ***
## MonthAugust             -7.815e+03  9.346e+02  -8.362  < 2e-16 ***
## MonthSeptember          -6.771e+03  9.418e+02  -7.190 9.55e-13 ***
## MonthOctober            -5.059e+03  9.340e+02  -5.416 6.91e-08 ***
## MonthNovember           -6.112e+03  9.427e+02  -6.483 1.16e-10 ***
## MonthDecember           -6.701e+03  1.142e+03  -5.866 5.32e-09 ***
## COVID_Lockdown1         -3.486e+01  1.085e+00 -32.115  < 2e-16 ***
## Ultra_Edition1           9.210e-01  5.638e-01   1.633 0.102545    
## Phone_241                1.101e+01  5.292e-01  20.810  < 2e-16 ***
## Positive_News1           1.271e+01  7.680e-01  16.552  < 2e-16 ***
## Negative_News1          -2.188e+01  1.443e+00 -15.167  < 2e-16 ***
## Year_num:MonthFebruary   2.329e+00  4.684e-01   4.972 7.27e-07 ***
## Year_num:MonthMarch      1.663e+00  4.623e-01   3.598 0.000330 ***
## Year_num:MonthApril      3.216e+00  5.121e-01   6.280 4.25e-10 ***
## Year_num:MonthMay        3.188e+00  4.716e-01   6.760 1.87e-11 ***
## Year_num:MonthJune       3.471e+00  4.637e-01   7.485 1.12e-13 ***
## Year_num:MonthJuly       3.695e+00  4.631e-01   7.978 2.65e-15 ***
## Year_num:MonthAugust     3.880e+00  4.631e-01   8.379  < 2e-16 ***
## Year_num:MonthSeptember  3.367e+00  4.667e-01   7.215 7.97e-13 ***
## Year_num:MonthOctober    2.522e+00  4.628e-01   5.449 5.79e-08 ***
## Year_num:MonthNovember   3.051e+00  4.672e-01   6.531 8.53e-11 ***
## Year_num:MonthDecember   3.347e+00  5.662e-01   5.912 4.05e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.722 on 1766 degrees of freedom
## Multiple R-squared:  0.9356, Adjusted R-squared:  0.9345 
## F-statistic: 884.3 on 29 and 1766 DF,  p-value: < 2.2e-16
sum_stats <- summary(linear_reg_fit_final)

Note that this multiple linear regression model has an adjusted R-square of 0.9345135, which means this model can explain 93.45% of the information, thus a very accurate model.

There are some interesting insights has shown from this model:

  • For every $1000 spent on the advertising the total daily units sold is increased by 1.43,

  • For every one year increase, the average total daily units sold is increase by 3.44,

  • On the other hand, December is the best performing month in terms of total daily units sold, 6700.65 more compared to January.

  • COVID Lockdown has been shown to reduce the total daily units sold by 34.86.

  • Positive coverage has been shown to increase the total daily units sold by 12.71,

  • Negative coverage has shown to reduce the total daily units sold by 21.88

  • Having the 24/7 call line increase the total daily units sold by 11.01

  • Finally, the final multiple linear regression model suggests there are no statistically significant effects on the total units sold on:

    • with or without the ultra edition,
    • having competition and
    • between the days of each week.

Modelling using Poisson regression

Poisson regression is often used for modelling count data. Thus, below is the modelling results using the Poisson regression analysis. I used the identity link function here, as there is no need to perform any additional transformation on the target variable from the above inspection on the distribution.

Fitting the model

poisson_reg_fit <-
  glm(
   Sales ~ Ad_Spend + (Year_num * Month) / Day +
      COVID_Lockdown + Ultra_Edition + Competition + Phone_24 + 
      Positive_News + Negative_News,
    data = market_sale_final,
    family = poisson(link = "identity")
)

ANOVA table of the initial model

car::Anova(poisson_reg_fit, type = 3) 
## Analysis of Deviance Table (Type III tests)
## 
## Response: Sales
##                    LR Chisq Df Pr(>Chisq)    
## Ad_Spend            1753.23  1  < 2.2e-16 ***
## Year_num              17.16  1  3.438e-05 ***
## Month                 55.86 11  5.385e-08 ***
## COVID_Lockdown       645.92  1  < 2.2e-16 ***
## Ultra_Edition          0.41  1     0.5218    
## Competition            0.01  1     0.9039    
## Phone_24             109.69  1  < 2.2e-16 ***
## Positive_News        102.92  1  < 2.2e-16 ***
## Negative_News         96.66  1  < 2.2e-16 ***
## Year_num:Month        56.13 11  4.808e-08 ***
## Year_num:Month:Day    23.81 72     1.0000    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Choose a model by AIC in a Stepwise Algorithm

poisson_reg_fit_final <- step(poisson_reg_fit)
## Start:  AIC=12024.14
## Sales ~ Ad_Spend + (Year_num * Month)/Day + COVID_Lockdown + 
##     Ultra_Edition + Competition + Phone_24 + Positive_News + 
##     Negative_News
## 
##                      Df Deviance   AIC
## - Year_num:Month:Day 72   777.31 11904
## - Competition         1   753.52 12022
## - Ultra_Edition       1   753.91 12023
## <none>                    753.50 12024
## - Negative_News       1   850.16 12119
## - Positive_News       1   856.43 12125
## - Phone_24            1   863.20 12132
## - COVID_Lockdown      1  1399.42 12668
## - Ad_Spend            1  2506.74 13775
## 
## Step:  AIC=11903.95
## Sales ~ Ad_Spend + Year_num + Month + COVID_Lockdown + Ultra_Edition + 
##     Competition + Phone_24 + Positive_News + Negative_News + 
##     Year_num:Month
## 
##                  Df Deviance   AIC
## - Competition     1   777.32 11902
## - Ultra_Edition   1   777.79 11902
## <none>                777.31 11904
## - Year_num:Month 11   832.53 11937
## - Negative_News   1   878.02 12003
## - Positive_News   1   885.49 12010
## - Phone_24        1   887.32 12012
## - COVID_Lockdown  1  1428.28 12553
## - Ad_Spend        1  2586.82 13712
## 
## Step:  AIC=11901.96
## Sales ~ Ad_Spend + Year_num + Month + COVID_Lockdown + Ultra_Edition + 
##     Phone_24 + Positive_News + Negative_News + Year_num:Month
## 
##                  Df Deviance   AIC
## - Ultra_Edition   1   777.98 11901
## <none>                777.32 11902
## - Year_num:Month 11   833.35 11936
## - Negative_News   1   878.06 12001
## - Positive_News   1   885.50 12008
## - Phone_24        1   953.41 12076
## - COVID_Lockdown  1  1499.11 12622
## - Ad_Spend        1  2587.15 13710
## 
## Step:  AIC=11900.61
## Sales ~ Ad_Spend + Year_num + Month + COVID_Lockdown + Phone_24 + 
##     Positive_News + Negative_News + Year_num:Month
## 
##                  Df Deviance   AIC
## <none>                777.98 11901
## - Year_num:Month 11   833.78 11934
## - Negative_News   1   878.94 12000
## - Positive_News   1   886.18 12007
## - Phone_24        1   966.16 12087
## - COVID_Lockdown  1  1507.78 12628
## - Ad_Spend        1  2589.46 13710

ANOVA table of the final model

car::Anova(poisson_reg_fit_final, type = 3)
## Analysis of Deviance Table (Type III tests)
## 
## Response: Sales
##                LR Chisq Df Pr(>Chisq)    
## Ad_Spend        1811.48  1  < 2.2e-16 ***
## Year_num          36.31  1  1.681e-09 ***
## Month             55.55 11  6.138e-08 ***
## COVID_Lockdown   729.81  1  < 2.2e-16 ***
## Phone_24         188.19  1  < 2.2e-16 ***
## Positive_News    108.20  1  < 2.2e-16 ***
## Negative_News    100.97  1  < 2.2e-16 ***
## Year_num:Month    55.80 11  5.523e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model diagnostics

hist(poisson_reg_fit_final$residuals)

plot(fitted(poisson_reg_fit_final), resid(poisson_reg_fit_final) )

# calculate the mean squared error
mse_poisson <- 
  sum(market_sale_final$Sales - 
  predict.glm(poisson_reg_fit_final,
           newdata = market_sale_final, 
           type = "response") ) ^2

The AIC (1.1900615^{4}) and mean square error (1.2061961^{-17}) from the Poisson regression model are slightly better than the estimates from the linear regression above (AIC = 1.1394401^{4} and mean square error = 2.1707956^{-13}). This histogram of the residual is also relatively normal, and the residual plot is randomly scattered around the residual of zero. Thus, we will use the Poisson regression for our final prediction.

Parameter Estimates

summary(poisson_reg_fit_final)  
## 
## Call:
## glm(formula = Sales ~ Ad_Spend + Year_num + Month + COVID_Lockdown + 
##     Phone_24 + Positive_News + Negative_News + Year_num:Month, 
##     family = poisson(link = "identity"), data = market_sale_final)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.07349  -0.47085  -0.00356   0.39677   2.68826  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              6.342e+03  1.041e+03   6.090 1.13e-09 ***
## Ad_Spend                 1.423e-03  3.445e-05  41.314  < 2e-16 ***
## Year_num                -3.118e+00  5.159e-01  -6.044 1.51e-09 ***
## MonthFebruary           -4.662e+03  1.332e+03  -3.501 0.000464 ***
## MonthMarch              -3.444e+03  1.300e+03  -2.649 0.008084 ** 
## MonthApril              -6.375e+03  1.378e+03  -4.628 3.70e-06 ***
## MonthMay                -6.319e+03  1.329e+03  -4.755 1.98e-06 ***
## MonthJune               -6.912e+03  1.352e+03  -5.113 3.18e-07 ***
## MonthJuly               -7.512e+03  1.359e+03  -5.527 3.26e-08 ***
## MonthAugust             -7.721e+03  1.397e+03  -5.527 3.26e-08 ***
## MonthSeptember          -6.705e+03  1.453e+03  -4.614 3.95e-06 ***
## MonthOctober            -5.087e+03  1.450e+03  -3.509 0.000450 ***
## MonthNovember           -6.109e+03  1.543e+03  -3.959 7.54e-05 ***
## MonthDecember           -6.935e+03  2.006e+03  -3.457 0.000546 ***
## COVID_Lockdown1         -3.469e+01  1.180e+00 -29.386  < 2e-16 ***
## Phone_241                1.102e+01  8.003e-01  13.773  < 2e-16 ***
## Positive_News1           1.232e+01  1.245e+00   9.893  < 2e-16 ***
## Negative_News1          -2.135e+01  1.909e+00 -11.184  < 2e-16 ***
## Year_num:MonthFebruary   2.310e+00  6.598e-01   3.501 0.000464 ***
## Year_num:MonthMarch      1.707e+00  6.443e-01   2.650 0.008057 ** 
## Year_num:MonthApril      3.160e+00  6.826e-01   4.629 3.68e-06 ***
## Year_num:MonthMay        3.133e+00  6.584e-01   4.758 1.95e-06 ***
## Year_num:MonthJune       3.428e+00  6.699e-01   5.117 3.10e-07 ***
## Year_num:MonthJuly       3.726e+00  6.735e-01   5.533 3.15e-08 ***
## Year_num:MonthAugust     3.834e+00  6.923e-01   5.538 3.05e-08 ***
## Year_num:MonthSeptember  3.334e+00  7.201e-01   4.631 3.64e-06 ***
## Year_num:MonthOctober    2.536e+00  7.183e-01   3.530 0.000415 ***
## Year_num:MonthNovember   3.049e+00  7.646e-01   3.988 6.66e-05 ***
## Year_num:MonthDecember   3.463e+00  9.943e-01   3.483 0.000495 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 11657.40  on 1795  degrees of freedom
## Residual deviance:   777.98  on 1767  degrees of freedom
## AIC: 11901
## 
## Number of Fisher Scoring iterations: 4
sum_stats <- summary(poisson_reg_fit_final)

There are some interesting insights has shown from this Poisson regression model:

  • For every $1000 spent on the advertising the total daily units sold is increased by 1.42,

  • For every one year increase, the average total daily units sold is increase by 3.12,

  • On the other hand, December is the best performing month in terms of total daily units sold, 6935.1 more compared to January.

  • COVID Lockdown has been shown to reduce the total daily units sold by 34.69.

  • Positive coverage has been shown to increase the total daily units sold by 12.32,

  • Negative coverage has shown to reduce the total daily units sold by 21.35

  • Having the 24/7 call line increase the total daily units sold by 11.02

  • Finally, the final multiple linear regression model suggests there are no statistically significant effects on the total units sold on:

    • with or without the ultra edition,
    • having competition and
    • between the days of each week.

Final Prediction results for December of 2020

Exacting the final predicted sales for December of 2020 with the 95% confidence intervals using the final Poisson regression model.

poisson_reg_predict <- 
  predict.glm(poisson_reg_fit_final,
           newdata = dec_ad_final, 
           type = "response",
           se.fit = TRUE) 

dec_ad_final$Sales <- round(poisson_reg_predict$fit)

dec_ad_final$Sales_max <- with(poisson_reg_predict, fit + 1.96*se.fit) |> round()
dec_ad_final$Sales_min <- with(poisson_reg_predict, fit - 1.96*se.fit) |> round()

Table of the daily total sales for December of 2020.

dec_ad_final %>% 
  select(Date, Sales, Sales_min, Sales_max) %>% 
  knitr::kable()
Date Sales Sales_min Sales_max
2020-12-01 120 115 124
2020-12-02 116 112 121
2020-12-03 112 107 116
2020-12-04 116 111 121
2020-12-05 135 130 140
2020-12-06 105 101 110
2020-12-07 122 117 127
2020-12-08 117 113 122
2020-12-09 128 123 132
2020-12-10 107 102 111
2020-12-11 110 105 115
2020-12-12 124 119 128
2020-12-13 135 130 140
2020-12-14 113 108 118
2020-12-15 111 107 116
2020-12-16 120 115 125
2020-12-17 116 112 121
2020-12-18 109 104 114
2020-12-19 117 112 122
2020-12-20 132 127 137
2020-12-21 109 104 114
2020-12-22 137 132 142
2020-12-23 115 110 119
2020-12-24 128 123 132
2020-12-25 105 100 110
2020-12-26 124 120 129
2020-12-27 106 101 111
2020-12-28 105 101 110
2020-12-29 146 141 151
2020-12-30 123 118 127
2020-12-31 118 113 123

The Poisson regression model predicts the total unit sales in December 2020 is 3681 with 95% confidence intervals of 3533 and 3828. Thus, the model suggests that the target sale of 3900 units in December 2020 is unlikely to meet.

The plot below is an interactive plot that contains both the historical (green colour-coded) and forecast (red colour-coded) sales with 95% confidence intervals. The user can hover the mouse cursor over the plot to see the actual estimates and highlight a session of the plot to zoom-in.

market_sale_final$Type <- "Historical"
dec_ad_final$Type <- "Forecast"

market_sale_final$Sales_max <- 
  market_sale_final$Sales_min <- 
  market_sale_final$Sales

g <- 
  market_sale_final %>% 
  bind_rows(dec_ad_final) %>% 
  mutate(Type = factor(Type)) %>% 
  ggplot(aes(x = Date, y = Sales, col = Type, group = Type)) +
  geom_path() +
  geom_ribbon(aes(ymin = Sales_min, ymax = Sales_max, fill = Type), alpha = 0.2) +
  scale_x_date(date_breaks = "6 month", expand = c(0.01, 0.01), 
                 date_labels = "%b %Y") + 
  theme_light()


plotly::ggplotly(g) %>% plotly::hide_legend()

Notes

An alternative method is using a time-series-based analysis with the forecast R package, which can better handle autocorrelation, seasonality, and other temporal patterns in the data. However, applying the method within the forecast R package requires the dataset to be converted to a Time-Series (ts) object. This process can be complicated, especially if we need to include other predictors such as “Total Advertising Spend” in the model for inference and prediction.

In addition, I have also attempted to use the eXtreme Gradient Boosting (XGBoost) model to see if I can obtain a model with an even smaller mean square error to ensure the total sales prediction is accurate and robust, as this method has become a popular method on constructing a predictive machine learning model. However, I have found that my linear regression model gave much better mean square error estimates for this particular dataset. Thus, I have not include it in the final report. Please find its implementation in the 1_modelling.R file of the Rcode folder.